Age of Respondents

Objective: create a bar chart that shows the age distribution of respondents.

While this does not require complex transformations, the intent of this notebook is to showcase some exploratory data analysis and data cleaning to solve some common issues with numeric data, with a nice visualization at the end.

Notebook setup

Here we import the necessary modules

And load the CSV

We can keep only the columns needed for this notebook

Note: we'd need a second column for counting if we were going to use aggregations to calculate the age frequencies.

Now that we have the data loaded, the next step is to process it.

To get Age fully ready for visualization, we need to complete four operations:

Let's explore the data with a scatter plot to see the ages reported.

Remove outliers

The scatter plot shows that respondents close to 100 years or older may not exactly represent accurate data points. Furthermore, there is threshold at which children will be too young to be replying to the survey.

Additionally, Plotly visualizations are interactive so we can freely zoom in and out of the visual to analyse parts in greater detail.

With everything considered, we can filter the data keeping only respondents that are at least 10 years-old or younger than 76. Of course, there may have been respondents that were truly outside this range, but this seems a safe range considering the context.

The age distribution seems cleaner.

There are two areas with slightly different distributions from the rest (before the 40k respondents and after the 60k), which may indicate an interesting pattern to delve deeper into or maybe just a coincidence from the way data is sorted. However we won't explore that here.

In the context of the complete dataset maybe you'd find an interesting insight, like only a certain age group was filling the survey in a short timeframe, but in this example our only objective is to arrive at that bar chart of ages.

Remove decimal ages

Next we need to remove decimal ages. There are about twenty respondents with decimal ages, e.g. 15.5 years. This data issue can be dealt with using different approaches, like rounding up or down, replacing by the average age, ..., but we'll take the easy way out and simply remove them.

Remove blanks

As with the decimal ages, blank responses can be resolved in multiple ways. Again, we'll simply remove them in this notebook.

There actually weren't any blank ages, but I wanted to show anyway how to drop rows with blanks through different conditions. The way the function was called, it deletes rows that have any blank Age only, it ignores blanks in the other columns.

The call to dropna has more arguments than needed, as "rows" is the default axis value, and Respondent is always filled in because it is the id of the respondent in the original dataset, so you wouldn't need the subset argument. Plus, "any" is also the default how value.

In other words, the call for this example in specific could be just data.dropna(), but this way I was able to show alternatives that make this function more versatile.

Create age frequencies

The last step we need before creating the bar chart is to count the frequency of each age, i.e., the values for the Y axis of the chart.

value_counts returns a Series with the ages as indices and their frequencies as values, sorted from most to least frequent.

Create the bar chart

If we pass the Series of frequencies to the bar function, it is enough to create the bar chart.

Plotly sorts the ages (X axis) automatically.